Alternative splicing analysis benchmark with DICAST

Abstract Alternative splicing is a major contributor to transcriptome and proteome diversity in health and disease. A plethora of tools have been developed for studying alternative splicing in RNA-seq data. Previous benchmarks focused on isoform quantification and mapping. They neglected event detection tools, which arguably provide the most detailed insights into the alternative splicing process. DICAST offers a modular and extensible framework for analysing alternative splicing integrating eleven splice-aware mapping and eight event detection tools. We benchmark all tools extensively on simulated as well as whole blood RNA-seq data. STAR and HISAT2 demonstrated the best balance between performance and run time. The performance of event detection tools varies widely with no tool outperforming all others. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection. Furthermore, we propose the first reporting standard to unify existing formats and to guide future tool development.


INTRODUCTION
Alternati v e splicing (AS) affects around 95% of eukaryotic genes with multiple exons ( 1 , 2 ) and gi v es rise to a large number of isoforms. AS is involved in cellular processes and disease de v elopment (see r ecent r e vie ws on cancer ( 3 ), muscle ( 4 ) and neuron ( 5 ) de v elopment). The most popular tech-nology to study AS is short-read RNA sequencing (RNA-Seq). Possibilities for AS analysis from short-read RNA-Seq data comprise splice-aware mapping, de novo transcriptome assembly, AS detection and / or quantification on the e xon, isoform, or e v ent le v el as well as differential splicing analysis. Each year, se v eral ne w tools are pub lished for each of these analysis types.
Existing benchmark studies that could guide users on which tool to use for which analysis have several limitations (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16). First, the studies use as ground truth either only a small subset of well-studied genes or simulated RNA-Seq data with randomly introduced AS e v ents and limited e v ent types. Second, they focused on AS detection tools that operate on the isoform le v el (6)(7)(8)(9)(10)(11), splice-aware mapping tools ( 12 , 13 ), and differential splicing analysis tools (14)(15)(16). Howe v er, AS detection tools that operate on the e v ent le v el ar e mor e pr ecise. Finally, benchmark studies ar e seldomly updated.
To perform a standar dised benchmar k, we created a modular pipeline called DICAST (Docker Integration for Comparison of AS tools). It uses Docker to handle the installation process and Snak emak e to handle AS analysis and the benchmark workflow. Its modularity allows for adding new tools in the future and for quickly updating the benchmark. As ground truth, we used RNA-Seq data sets that were simulated genome-wide and contained a predefined number and distribution of AS e v ents. The simulation with ASimulatoR ( 17 ) allowed us to evaluate the challenges and limitations of the tested tools systematically across data sets with varying le v els of difficulty.
While our main focus was to benchmark AS event detection tools, we started from splice-aware mapping to provide the best possible input to those tools. Thus, we evaluated 11 splice-aware mapping tools in addition to eight AS detection tools. We demonstrate the benefits and limitations of AS e v ent detection tools and suggest a putati v ely optimal strategy for comprehensi v e AS analysis.

DICAST
DICAST uses Docker for full reproducibility and to simplify deployment ( 18 , 19 ). Each docker image in DICAST can be used individually or as part of the benchmark / AS analysis pipeline. DICAST orchestrates the pipeline using Snak emak e ( 20 ). The DI-CAST source code and documentation are available at https: // github.com / CGAT-Group / DICAST and https://dicast.readthedocs .io/en/master/contents .html , respecti v ely.

Benchmark workflow
Supplementary Figure S1 illustrates the workflow for benchmarking AS analysis tools. RNA-Seq data sets were simulated using the R package ASimulatoR ( 17 ) with a sequencing depth of 200 million reads (read length 76 bp) based on the human genome hg38 and the Ensembl genome annotation, version 99 ( 21 ). We limited the simulation to chromosomes 1-22, X, Y and MT.
We tuned the complexity of the simulated data sets changing AS e v ents distribution, combinations, and sequencing error rate (Table 1 ). We started from the data set S0 with equal proportions of the four main types of AS e v ents and 0% sequencing error rate. Using this data set, we filtered out tools that demonstrate poor performance or long run time. Next, we added MES, ALE, and AFE to simula te da ta set S1 and to evaluate the impact of complex AS e v ents on tool performance. For data set S2, we raised the sequencing error rate up to 0.1% and investigated the impact of not-perfect sequencing. Data sets S0-S2 contain only one e v ent per transcript which is an artificial setting helping us to explore performance under simpler and wellcontrolled conditions. As all AS e v ent types combinations are possible, we evaluated how the AS e v ent types combinations impact the perf ormance: f or data set S3 we allowed two types of e v ents to occur within one transcript but with dif ferent exons; for da ta set S4 w e allow ed two types of e v ents to occur with the same exon. We also aimed to evaluate AS analysis tools in a biologically relevant setting but the real RNA-Seq data sets lack the genome-wide ground truth. We addressed this challenge and simulated a data set S5 with realistic e v ent proportions, which were obtained by analyzing 117 RNA-Seq samples from the SHIP cohort (the Study of Health in Pomerania) ( 22 ).
For each simulated data set, ASimulatoR estimated how many genes have enough exons to create AS events and can, hence, be used for the simulation. For the simulated set S3, this number was the lowest --37 648. We used this number of genes as a parameter for all simulated data sets to preserve the read coverage level.
We used seqtk (available at https://github.com/lh3/seqtk ) to downsample fastq files uniformly at random to 10 million reads to benchmark splicing-aware mapping tools. We then mapped these data sets to the human genome using 11 splice-aware mapping tools and calculated the proportion of unmapped reads relati v e to the number of all simulated reads ('fraction of unmapped reads') and the proportion of correctl y ma pped r eads and junctions r elati v e to the number of all mapped r eads ('pr ecision'). Pr ecision and fraction of unmapped reads for splice-aware mapping tools were calculated based on the resulting alignment file and the correct coordinates provided by ASimulatoR.
We chose the mapping tool with the best balance between precision and fraction of unmapped reads values and used it to produce alignments for 8 AS detection tools. We analyzed fastq files with 200 million reads and additionaly used seqtk to downsample them to 50 and 100 million reads. We then calculated the proportion of correctly found e v ents relati v e to the number of e v ents in the simulated set ('recall') and relati v e to the number of e v ents found by the tool ('precision'). Pr ecision and r ecall for AS e v ent detection tools were calculated based on the unified output of DICAST and the description of correct e v ents provided by ASimulatoR.
We estimated the variability in performance of splicingaware mapping tools and AS detection tools by simulating fiv e samples for the most simple data set (S0) and fiv e samples for the biolo gicall y-relevant data set (S5).

SHIP cohort
The Study of Health in Pomerania (SHIP) is a longitudinal population-based cohort study located in the area of West Pomerania (Northeast Germany). For RNA-Seq analysis, total RNA was extracted from whole blood with a mean RNA integrity of 8.5. Based on 500ng total RNA per sample, a library was pr epar ed using the TruSeq Stranded mRNA kits (A and B) with 24 barcodes and 117 samples were sequenced on Illumina HiSeq 4000, 2 × 101 bp pair ed-end r eads with a sequencing depth 40 million clusters per sample. Written informed consent was obtained from SHIP-TREND study participants, and all protocols were approved by the institutional ethical re vie w committee in adherence with the Declaration of Helsinki. We used these 117 RNA-Seq samples from the SHIP-TREND cohort ( 22 ) in this study. They were mapped to the human genome hg38 with STAR using the Ensembl genome annotation, version 99, and analyzed further with MAJIQ ( 23 ). Custom scripts in Python 3 were used to obtain the number of AS e v ents and genes with AS.

Tools for benchmark
We collected splice-aware mapping and AS e v ent detection tools released between 2010 and 2021 based on a Pubmed and Google Scholar search with the following inclusion criteria. A tool should be (i) available, (ii) documented, (iii) under an open-source license, (iv) available as standalone software (w e b-services can usually not process a large amount of data), (v) used in project(s) other than described in the tool publication, (vi) able to process widely used file formats such as fasta / fastq, gtf / gff3, and bam / sam. Since we controlled the number of AS e v ents in a custom genome annotation, a tool should be (vii) able to work with those as well. The final list of tools contains 11 splice-aware mapping tools and 8 AS detection tools ( Table 2 ).

Alternative splicing analysis and benchmark pipeline
A unified output f ormat f or AS event detection. The main challenge for comparing AS e v ent detection tools is the lack of standard output format. For an exon skipping e v ent, AS-GAL reports the coordinates of neighboring exons; ASpli reports the coordinates of a skipped exon itself; MAJIQ reports the coordinates of a neighboring exon and its junctions. We propose a unified f ormat f or all AS e v ent types that reports the coordinates of (i) skipped exons for exon skipping, multiple exon skipping, m utuall y e xclusi v e e xons; (ii) a retained intron for intron retention; (iii) an alternati v e part of an exon for alternative splice sites (Supplementary Figure S2). Additionally, for each e v ent, the unified output format contains the gene name , chromosome , strand as well as a unique ID.
DICAST: Doc k er Integrated Comparison of Alternative Splicing Tools. The next necessity for the standardization of the AS analysis and benchmark process is a unified pipeline. DICAST handles (i) the installation process for e v ery tool using Docker; (ii) the e xecution of tools using Snak emak e; (iii) the output forma t unifica tion; (iv) the comparison of AS e v ents detected by different tools.
The general workflow starts with short-read RNA-Seq data (optionall y sim ulated by ASimulatoR ( 17 ), Figure 1 ) that serves as input for splice-aware mapping tools. The resulting alignment files then serve as input for AS event detection tools. Ne xt, DICAST conv erts the output files of the AS e v ent detection tools to the unified format described above, compares detected events across the tools, and reports the results of the comparison. A user could run all or separate steps of the workflow with their RNA-Seq reads, alignment and genome annotation files.

Benchmark results: splice-aware mapping tools
Splice-aware mapping tools differ in the mapping approach and their use of genome annotations. Most tools use variations of the seed-and-extend algorithm and start with aligning parts of a read (seeds). ContextMap and MapSplice2 first align reads end-to-end and use the seed-and-extend algorithm for reads that could not be aligned in the first step. Some tools can use genome annotations for additional information (e.g., splice sites) ( Table 2 ), while others (e.g. BBmap) do not need it.
We provided a genome annotation where possible and used the best mode for AS analysis (e.g. a 2-pass mode for STAR). We downsampled the input files to 10 million reads uniformly at random to reduce the analysis time. For data sets S0 and S5 we repeated the alignment with fiv e simulated datastets to evaluate the robustness of the performance.
The ma pping a pproach has a limited effect on the performance of the tool. ContextMap and MapSplice2 show comparable performance metrics as the tools with the best performance such as STAR and HISAT2 (Figure 2 , Supplementary Figure S3). Tools that use genome annotation generally show better performance. Surprisingly, the complexity of the data sets only marginally affects the performance of the tools. Only DART suffers from a decrease in precision with an increasing sequencing error rate. Spliceaware mapping tools are robust in terms of precision (Figure 2 B): the standard deviation is less than 0.02. Concerning the fraction of unmapped reads we observe two categories: BBMap, CRAC, DART, HISAT2 and STAR perform well with a low fraction of unmapped reads and a low standard deviation while others perform poorly in both metrics.
In summary, STAR and HISAT2 showed a balance between precision and fraction of unmapped reads compared to other tools. ContextMap demonstrated the best precision and could be considered for the analysis if the sequencing depth of samples is deep enough and a user can allow the loss of 10-15% of reads. Unfortunatel y, ContextMa p has long run time (see run time subsection below) which might be inconvenient when analyzing a large number of samples. We chose STAR for further analysis, as it is more widely used (28 433 Google Scholar citations by January 2023). We downsampled data sets to 200, 100 and 50 million reads uniformly at random to investigate the effect of sequencing depth on performance.

Alternative splicing detection tools
AS e v ent detection tools detect e v ents from genome annota tion (ASpli, SGSeq (annota ted transcripts), IRFinder), RNA-Seq alignment (EventPointer, SGSeq ( de novo ), MA-JIQ), or both, i.e. by augmenting the genome annotation through alignment (ASGAL, splAdder, Whippet). As most tools ( Table 2 ) do not detect m utuall y e xclusi v e e xons and multiple exon skipping, we treated such e v ents as e xon skipping e v ents. Figure 3 A (upper part) shows results for data set S0. After examining the results, we excluded EventPointer, SGSeq ( de novo ), and splAdder from the further detailed analysis as these tools demonstrate low recall (less than 10% of recov ered e v ents). We also e xcluded ASGAL because of the long run time (a genome-wide analysis of 50 million reads took around 3 days). Supplementary Figure S4 shows results for data sets S1-S4. The ranking of the tools differs onl y slightl y. Whippet and SGSeq Anno (annotated transcripts) show the best performance. For all tools in data sets S0-S4, recall depends on the sequencing depth, while precision does not.
MAJIQ was used to deri v e the proportion of AS e v ent types and combinations from the SHIP cohort (Supplementary Figure S5). We used the proportions as a parameter for ASimulatoR to simulate the biologically-inspired set S5. Note that only the proportion values, but not the detected AS e v ents were used, so MAJIQ and the other tools re-mained unaware of exact AS events in S5. We used MAJIQ here, since it does not depend on genome annotation and can e xtract e v ents directly from alignments. The results for data set S5 (Figure 3 A, lower part) demonstrate almost the same ranking of the tools.
Additionally, we investigated the ability of tools to detect e v ents de novo . For this purpose, we edited a genome annotation file used as input for AS detection tools. Genome annotation describes the exon composition for all transcripts simulated from a gene: main and alternati v es. We trunca ted annota tions and kept only the description of the main transcript for each gene. We denote this data set as S5-tr(uncated).
The tested tools differ significantly in their abilities to detect de novo e v ents based on truncated genome annotation. ASpli and SGSeq (annotated transcripts) did not detect any nov el e v ents. For other tools, we compared precision and recall for S5 and S5-tr for each AS e v ent type indi vidually (Figure 3 B). Whippet did not find any de novo intron retention e v ents. The recov ery ra te of alterna ti v e splice sites also dropped for all tools: for MAJIQ., precision decreased fr om ar ound 0.9 to 0.5; recall decreased from 0.4 to 0.1; for Whippet, pr ecision decr eased fr om ar ound 0.7 to 0.4; recall decreased from 0.8 to 0.2 (evaluated for 50M reads).
For data sets S0 and S5 we repeated the AS detection with the full genome annotation fiv e times to evaluate the robustness of the performance. All evaluated AS detection tools demonstrated robust performance (Figure 3 C). On data set S5 with the most complex combinations of AS events, the increasing sequencing depth leads to incr easing r ecall but slightly decreasing precision.
Notably, the results of different tools show comparably little overlap (Figure 4 ). Most events are detected exclusi v ely by one tool (e.g., Whippet), suggesting that an integrati v e approach is needed for the comprehensi v e analysis of AS e v ents from short-read RNA-Seq data: (i) detect known e v ents using tools based on genome annotation (e.g. ASpli, Whippet); (ii) detect intron retention de novo using IRFinder; (iii) detect de novo e v ents using Whippet and / or MAJIQ.

Run time
Using the data set S0, we estimated the run time of the tools tested on Intel ® Xeon ® Gold 6148 Processor with 27.5M Cache, 2.40 GHz ( Figure 5 ). For the mapping tools, we used the downsampled data set with 10 million reads from the benchmark. Most tools perform indexing and mapping within an hour. ContextMap is the slowest tool with a run time of around 11 h. minimap2 and BBmap are the fastest tools with a run time of only se v eral minutes. We also es-timated the run time of AS detection tools depending on the sequencing depth ( Figure 5 ). The run time of MAJIQ does not depend on the sequencing depth and is in the range of minutes. Most other tools run within se v eral hours, with splAdder taking up to 10 hours for 200 million reads. AS-GAL was only evaluated with 50 million r eads, alr eady running longer than 72 h to complete.

DISCUSSION
We investigated the performance of AS e v ent detection tools in a comprehensi v e benchmar k. We started with spliceaware mapping tools to obtain the best possible input for e v ent detection. Among splice-aware mapping tools, STAR and HISAT2 present the best balance between the precision value, fraction of unmapped reads values, and the run time, which is in concordance with previous findings ( 12 ).
Concerning AS e v ent detection, we still find much room for improvement since tools with high recall values (Whippet, SGSeq (annotated transcripts), ASpli) can not detect e v ents de novo . Vice versa, tools that can identify e v ents de novo demonstrate poor recall values. We suggest using a combination of existing tools for a comprehensi v e AS analysis on the e v ent le v el using short-read RNA-Seq data.
While we aimed for a comprehensi v e, objecti v e, and standar dized benchmar k, we faced some limitations. First, the simula ted da ta sets might not reach the same le v el of complexity as real biological data sets. We mitigated this by mimicking the proportions of AS e v ent types as observed in the SHIP cohort. Second, for run time r easons, we r efrained from a tool-specific parameter tuning and relied on    the default parameters, which should ideally lead to optimal results in typical settings. Additionally, we did not want to favor tools with higher flexibility simply due to an increased number of parameters to tune. Finally, we did not explore the performance of the machine and deep learning-based tools (e.g. DARTS ( 43 ), IRFinder-S ( 44 )). The simulated AS e v ents were added randomly and do not account for, e.g., regula tory sequences tha t are used in machine learning a pproaches. Finall y, our benchmark is based on a human genome annotation. The results might differ for organisms with dissimilar AS patterns (e.g. for plants). Howe v er, Baruzzo et al. ( 12 )showed that at least the performance of RN A-Seq ma pping tools does not differ dramatically between human and Plasmodium falciparum genomes.
Many challenges in AS analysis have yet to be addressed, including general difficulties for tool de v elopment such as the lack of ef ficient paralleliza tion and substantial run time. We found that the standard format for aligned reads --bam / sam / cram --differs between tools and might not be compatible with some AS event detection tools. For instance, onl y few ma pping tools add an XS tag that indicates the strand orientation of an intron which is needed by tools such as SGSeq. We gathered the information about tools compatibility her e: https://dicast.r eadthedocs.io/en/ master/tools/tools.html . This incompatibility limits the us-ability of many splice-aware alignment tools for AS analysis. While DICAST introduces a unifying standard for AS event reporting, AS e v ent detection tools utilize inherently different approaches and lead to inconsistent results. To mitigate this, DICAST allows users to execute any combination of tools and facilitates adding newly published tools. By standardizing the output of AS event detection tools, DICAST significantly simplifies downstream analysis. In summary, DICAST offers a unified interface for existing methods and boosts method de v elopment by offering an easily extensib le frame wor k for benchmar king of e xisting and nov el AS anal ysis a pproaches.

DA T A A V AILABILITY
Access to the SHIP data for r esear ch purposes may be requested at https://www.fvcm.med.uni-greifswald.de/ dd service/data use intro.php . The description of simulated data sets and the corresponding R scripts are available at https://doi.org/10.5281/zenodo.7573144 .